Geospatial/SAE workshop

Introduction - SAE and R

Josh Merfeld

November 2025

Introduction

Introduction

General outline

  • Today: starting easy!
    • Hopes for the workshop? Goals? (done!)
  • Review of everyone’s experiences with R
  • Making sure we are all on the same page

Why small area estimation?

Geospatial data is widely available!

  • One estimate says that 100 TB of only weather data are generated every single day
    • This means there is a lot of data to work with!
    • Note that this is also problematic, since it can be difficult to work with such large datasets
  • Geospatial data is used in a variety of fields
    • Agriculture
    • Urban planning
    • Environmental science
    • Public health
    • Transportation
    • And many more!

The amount of geospatial data is useful for SAE

  • Geospatial data can be highly predictive of e.g. poverty
    • Urbanity
    • Land class/cover
    • Vegetation indices
    • Population counts
    • etc. etc.
  • More importantly: it’s available everywhere!

Think of what you need for SAE

  • You need a sample, e.g. a household survey
    • This will only cover some of the country
  • You need auxiliary data that is:
    • Predictive of the outcome you care about
    • Available throughout the entire country
  • Some countries use administrative data
    • But, importantly, it’s often not available or is of low quality!

A quick example

  • Let’s take a look at Malawi

  • Why Malawi?

    • I have survey data you can use 😃
    • Only going to use part of Malawi for this example (size of data)
  • Consider the 2019/2020 Integrated Household Survey (IHS5)

    • Was used for the Malawi Poverty Report 2020
    • Can say things about poverty at the district level
    • If you want to split by urban/rural, only at the region level

A quick example

Malawi admin areas - Northern region only

  • Survey only lets us say things about the districts!
  • What if we want to say something about traditional authorities (TAs)?
  • Individual TAs might not have enough observations
  • We could use SAE! But what auxiliary data?

Census data?

  • Traditional SAE uses census data
  • The problem:
    • Some countries do not have census data
    • Others have census data that is old (how well old census data works is an area of active research!)

But geospatial data can be used!

  • Geospatial data available more or less everywhere on earth
  • Geospatial data is updated frequently!
    • Some imagery datasets are updated daily
    • Most updated at least annually

R and RStudio

R and RStudio

  • We need to have R and RStudio installed for what’s next
    • Another code editor is also acceptable: VS Code, for example

Goal for the session

  • The goal for today is to give you a brief introduction to R and R Markdown

  • We will be using two small datasets to get you familiar with the program

  • A note: if you are completely new to R, the first few weeks will be a slog

    • It will get better, I promise
  • Much of the material covered today comes from two (free!) sources:

What are R and RStudio?

  • R is a commonly used statistical program (and language)
    • It is free and open source, which means you can use this after graduation, without paying for it
    • R is CaSe SeNsItIvE
  • To work with R, we want to use an accompaniment called RStudio
    • RStudio is what is referred to as an integrated development environment (IDE)
    • It is not the only option (VS Code, for example), but it is the most common
    • It makes working with R much easier
  • Whenever you start R, you want to start RStudio
    • RStudio will start R for you

Some important considerations

  • One of our goals is to make reproducible research
    • This means that we want to be able to share our code and have others be able to replicate our results
    • To do this, we will use “scripts” that contain our code
  • A script should be self contained
    • This means that it should contain all of the code necessary to run the analysis
    • A well-written script should allow me to do everything without any additional information
    • Note that more complicated projects can have many scripts! For this class: one script per assignment
  • We will learn about using R Markdown to create documents
    • R Markdown is a way to combine text and code
    • This allows us to create documents that are reproducible

The RStudio interface

The RStudio interface

The RStudio interface

The RStudio interface

But we’re missing something… what is it?

The script

Some notes

  • You can add comments to your script using a hashtag (#)
    • At the top of ALL my scripts, I have a comment that says what the script does.
    • At the top of your script, write a comment. It should say “# Week 1 - Introduction to R”
    • I put LOTS of comments in my scripts. This is good practice.
Code
# cleaning the gps data and creating some maps
# to run this, just set your working directory to the folder this script is located in
# Author: Josh Merfeld
# Initial date: September 5th, 2024
Code
# lasso --------------------------------------------
# we have ~60 features. This isn't that many, actually. We didn't create a lot of different possible combinations of the predictors.
# We also don't have any fixed effects. This is just to fix ideas. Nonetheless, let's try lasso!
# we use the glmnet package to implement lasso. It also allows ridge, but we want to make sure to use lasso.
# how do we do this? we want to allocate grid cells across different "folds".

Some notes

  • You can run a line of code by clicking the “Run” button

    • There are also shortcuts. On Mac it is command + enter. On windows it is control + enter. You can change these if you want.
  • You can run multiple lines of code by highlighting them and clicking the “Run” button (or the shortcut)

  • We will practice these later

R Basics

Object types

  • R has a few different types of objects
    • The most common are vectors, matrices, and data frames
      • A “tibble” is a type of data frame used by the tidyverse package (more below)
    • We will use data frames almost exclusively since we are working with datasets, but vectors are common, too
  • You can create a vector using the c() function:
    • Note how we create a new object using the assignment operator, <-. You can also use =.1
Code
vec <- c(1, 2, 3, 4)
vec
[1] 1 2 3 4

Object types

  • You can check what type of object something is by using the class() function
    • For example, if I want to check what type of object vec is, I would write class(vec)
    • Note that the output is “numeric”
    • This is because vec is a vector of numbers
Code
vec <- c(1, 2, 3, 4)
class(vec)
[1] "numeric"
  • If I want to check whether it is a vector, I can write is.vector(vec)
    • Note that the output is TRUE
Code
is.vector(vec)
[1] TRUE

First things first: the working directory

  • The working directory is the folder that R is currently working in
    • This is where R will look for files
    • This is where R will save files
    • This is where R will create files
  • You can always write out an entire file path, but this is tedious
    • More importantly, it makes your code less reproducible since the path is specific to YOUR computer
  • One nice thing about R is that the working directory will automatically be where you open the script from
    • Let’s try this. Save your script to a folder on your computer, then open the script from that folder.

First things first: the working directory

The working directory should be where you opened the file from. Check it like this:

Code
getwd()
[1] "/Users/Josh/Dropbox/Papers/UN-SAE/workshops/asia/bangkokworkshop"

R packages

  • R is a language that is built on packages
    • Packages are collections of functions that do specific things
    • R comes with a set of “base” packages that are installed automatically
  • We are going to use one package consistently, called the “tidyverse”
    • This consists of a set of packages that are designed to work together, with data cleaning in mind

R packages

The one exception to always using a script? I install packages in the CONSOLE. You can install packages like this:

Code
install.packages("tidyverse")
  • Note you MUST use quotes around the package name

Loading R packages in your script

We need to load any R packages we want to use at the very top of the script. You should have a comment on line one, so on line two write:

Code
library(tidyverse)

This will load the tidyverse package.

  • Note you do NOT need to use quotes around the package name

Loading data

  • Go to the class website and download the data for today.
    • Put it in your WORKING DIRECTORY (where the script is)
  • We will use the read_csv() function to load the data
    • This function is part of the tidyverse package
    • It will create a data frame
    • We need to NAME the object (data frame). As before, note the assignment operator (<-). You can actually use = though.
Code
library(tidyverse)

# read in the data
data <- read_csv("introdata/data.csv")

Objects in memory

The data frame should show up in the upper right hand corner of RStudio.

Objects in memory

Click on the arrow and it will show more information.

Objects in memory

  • The data frame is a matrix
    • Each row is an observation and each column is a variables
    • Think of what this would look like if you opened it in Excel or Stata. It’s the same.
  • We can also see the names of the columns like this:
Code
colnames(data)
 [1] "res_id"                 "ability"                "age"                   
 [4] "educyears"              "isfarmer"               "yearsfarming"          
 [7] "yearsmanagingfarm"      "outsidewage"            "worriedaboutcropprices"
[10] "worriedaboutcropyields"
  • This is the kind of thing I might do in the console since it’s not really required for the script.

Objects in memory

  • Here’s another handy quick-look functions
Code
glimpse(data)
Rows: 7,209
Columns: 10
$ res_id                 <dbl> 501, 502, 503, 504, 505, 506, 507, 508, 509, 51…
$ ability                <dbl> 74, 42, 67, 54, 57, 72, 51, 65, 54, 24, 24, 49,…
$ age                    <dbl> 83, 27, 49, 50, 70, 45, 58, 41, 45, 70, 24, 45,…
$ educyears              <chr> "16", "7", "7", "7", "4", "7", "7", "7", "7", N…
$ isfarmer               <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"…
$ yearsfarming           <dbl> 60, 17, 20, 15, 40, 15, 30, 20, 20, 60, 12, 30,…
$ yearsmanagingfarm      <dbl> 46, 17, 5, 10, 26, 15, 25, 15, 10, 50, 7, 25, 5…
$ outsidewage            <dbl> 3.000e+06, 1.000e+10, 6.000e+06, 1.000e+10, 1.0…
$ worriedaboutcropprices <chr> "Sometimes", "Not at all", "Sometimes", "Not at…
$ worriedaboutcropyields <chr> "Sometimes", "Sometimes", "Not at all", "Someti…

Objects in memory

  • And one more (“structure”)
Code
str(data)
spc_tbl_ [7,209 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ res_id                : num [1:7209] 501 502 503 504 505 506 507 508 509 510 ...
 $ ability               : num [1:7209] 74 42 67 54 57 72 51 65 54 24 ...
 $ age                   : num [1:7209] 83 27 49 50 70 45 58 41 45 70 ...
 $ educyears             : chr [1:7209] "16" "7" "7" "7" ...
 $ isfarmer              : chr [1:7209] "Yes" "Yes" "Yes" "Yes" ...
 $ yearsfarming          : num [1:7209] 60 17 20 15 40 15 30 20 20 60 ...
 $ yearsmanagingfarm     : num [1:7209] 46 17 5 10 26 15 25 15 10 50 ...
 $ outsidewage           : num [1:7209] 3e+06 1e+10 6e+06 1e+10 1e+10 ...
 $ worriedaboutcropprices: chr [1:7209] "Sometimes" "Not at all" "Sometimes" "Not at all" ...
 $ worriedaboutcropyields: chr [1:7209] "Sometimes" "Sometimes" "Not at all" "Sometimes" ...
 - attr(*, "spec")=
  .. cols(
  ..   res_id = col_double(),
  ..   ability = col_double(),
  ..   age = col_double(),
  ..   educyears = col_character(),
  ..   isfarmer = col_character(),
  ..   yearsfarming = col_double(),
  ..   yearsmanagingfarm = col_double(),
  ..   outsidewage = col_double(),
  ..   worriedaboutcropprices = col_character(),
  ..   worriedaboutcropyields = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

Calling variables in R

  • Some of you might be used to Stata
  • One big difference between the two is that Stata generally only has one data frame in memory at a time
    • This means that you can call a variable without referencing the data frame
  • In R, if you want to look at a variable, you have to tell R which data frame it is in
    • This is done with the $ operator
    • For example, if I want to look at the variable “age” in the data frame “data”, I would write data$age
    • Let’s look at summary statistics for age:
Code
summary(data$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18.00   34.00   42.00   43.54   52.00   87.00 

Summary statistics for the entire data frame

  • You can also use summary on the data frame instead of a single column
    • It helps to think of a data frame as rows and columns. For variables, you want to call specific columns.
  • Look at the difference here:
Code
summary(data)
     res_id        ability            age         educyears        
 Min.   : 501   Min.   : 10.00   Min.   :18.00   Length:7209       
 1st Qu.:2783   1st Qu.: 51.00   1st Qu.:34.00   Class :character  
 Median :4714   Median : 59.00   Median :42.00   Mode  :character  
 Mean   :4775   Mean   : 58.66   Mean   :43.54                     
 3rd Qu.:6764   3rd Qu.: 67.00   3rd Qu.:52.00                     
 Max.   :8955   Max.   :100.00   Max.   :87.00                     
                                                                   
   isfarmer          yearsfarming   yearsmanagingfarm  outsidewage       
 Length:7209        Min.   :-9.00   Min.   :-9.00     Min.   :2.000e+03  
 Class :character   1st Qu.:18.00   1st Qu.: 6.00     1st Qu.:3.500e+06  
 Mode  :character   Median :26.00   Median :14.00     Median :1.000e+10  
                    Mean   :28.02   Mean   :15.94     Mean   :7.156e+09  
                    3rd Qu.:38.00   3rd Qu.:22.00     3rd Qu.:1.000e+10  
                    Max.   :70.00   Max.   :70.00     Max.   :1.000e+10  
                    NA's   :219     NA's   :219       NA's   :216        
 worriedaboutcropprices worriedaboutcropyields
 Length:7209            Length:7209           
 Class :character       Class :character      
 Mode  :character       Mode  :character      
                                              
                                              
                                              
                                              

Calling rows/columns of a data frame (matrix)

  • Think about how we refer to rows and columns in a matrix.
    • We use the row and column number, in that order.
    • For example, if I want the first row and second column of a matrix \(X\), mathematically I could write \(X_{1,2}\)
  • We do the same thing in R
  • If I want the first row and second column of the data frame “data”, I would write data[1,2]
    • Note that we use square brackets instead of parentheses
    • Note that we use a comma to separate the row and column
Code
data[1,2]
# A tibble: 1 × 1
  ability
    <dbl>
1      74

Calling columns of a data frame (matrix)

  • We can call entire columns of a data frame by leaving the row blank
    • For example, if I want the second column of the data frame “data”, I would write data[,2]
    • Note that the second column is the ability variable
Code
colnames(data)
 [1] "res_id"                 "ability"                "age"                   
 [4] "educyears"              "isfarmer"               "yearsfarming"          
 [7] "yearsmanagingfarm"      "outsidewage"            "worriedaboutcropprices"
[10] "worriedaboutcropyields"
Code
data[,2]
# A tibble: 7,209 × 1
   ability
     <dbl>
 1      74
 2      42
 3      67
 4      54
 5      57
 6      72
 7      51
 8      65
 9      54
10      24
# ℹ 7,199 more rows

Missing variables R

  • Missing variables are denoted by NA
    • This is different from Stata, which uses a period (.)
  • Note that this is only how the PROGRAM stores missing variables. Sometimes the data itself has different missing values.
  • For example, take a look at the first ten rows of the data frame (also note how I call the first ten rows and leave out the first column!):
Code
data[1:10,-1]
# A tibble: 10 × 9
   ability   age educyears isfarmer yearsfarming yearsmanagingfarm outsidewage
     <dbl> <dbl> <chr>     <chr>           <dbl>             <dbl>       <dbl>
 1      74    83 16        Yes                60                46     3000000
 2      42    27 7         Yes                17                17  9999999999
 3      67    49 7         Yes                20                 5     6000000
 4      54    50 7         Yes                15                10  9999999999
 5      57    70 4         Yes                40                26  9999999999
 6      72    45 7         Yes                15                15      800000
 7      51    58 7         Yes                30                25     2000000
 8      65    41 7         Yes                20                15  9999999999
 9      54    45 7         Yes                20                10      300000
10      24    70 <NA>      Yes                60                50  9999999999
# ℹ 2 more variables: worriedaboutcropprices <chr>,
#   worriedaboutcropyields <chr>

Variable types

  • R also has a few different types of variables
    • The most common are numeric, character, and logical
  • Look at the previous code again:
Code
data[1:10,-1]
# A tibble: 10 × 9
   ability   age educyears isfarmer yearsfarming yearsmanagingfarm outsidewage
     <dbl> <dbl> <chr>     <chr>           <dbl>             <dbl>       <dbl>
 1      74    83 16        Yes                60                46     3000000
 2      42    27 7         Yes                17                17  9999999999
 3      67    49 7         Yes                20                 5     6000000
 4      54    50 7         Yes                15                10  9999999999
 5      57    70 4         Yes                40                26  9999999999
 6      72    45 7         Yes                15                15      800000
 7      51    58 7         Yes                30                25     2000000
 8      65    41 7         Yes                20                15  9999999999
 9      54    45 7         Yes                20                10      300000
10      24    70 <NA>      Yes                60                50  9999999999
# ℹ 2 more variables: worriedaboutcropprices <chr>,
#   worriedaboutcropyields <chr>

Variable types

  • dbl is short for double, which is a numeric variable (the “type” of numeric variable is about how much memory is needed to store it)
  • chr is short for character, which is a string of characters (text)
    • Surprisingly, in our previous example, educyears was a character string even though it seemed to be a number
    • Let’s look at the possible values of educyears using the unique() function, which outputs a vector:
Code
unique(data$educyears)
 [1] "16"            "7"             "4"             NA             
 [5] "11"            "6"             "13"            "5"            
 [9] "8"             "10"            "12"            "9"            
[13] "2"             "3"             "15"            "14"           
[17] "20"            "18"            "17"            "1"            
[21] "Not Mentioned" "19"           

Variable types

  • Interesting! It seems that there is a “Not Mentioned” value.
    • What if we want to replace those with missing, instead?
  • Let’s talk through the following code
    • First note how it refers to a specific column and then a specific row
    • Also note how it uses two equal signs (==) to check whether the value is “Not Mentioned”
      • This is similar to Stata!
Code
# replace "Not Mentioned" with NA
data$educyears[data$educyears=="Not Mentioned"] <- NA  
# check that it worked by looking at the unique values
unique(data$educyears)              
 [1] "16" "7"  "4"  NA   "11" "6"  "13" "5"  "8"  "10" "12" "9"  "2"  "3"  "15"
[16] "14" "20" "18" "17" "1"  "19"
Code
# turn into numeric
data$educyears <- as.numeric(data$educyears)
class(data$educyears)
[1] "numeric"

Pipes

  • One of the most useful things in R is the pipe operator (|>)
    • This is part of the tidyverse package
    • It allows you to chain commands together
    • It makes your code much easier to read
    • It makes your code much easier to write
    • It makes your code much easier to debug
    • It makes your code much easier to share
    • It makes your code much easier to reproduce
  • It’s easy to use but it will take some time for you to get used to the names of the functions we can use with it
    • This also goes for other tasks in R, not just with the pipe operator

Pipes example

Here is an example of how we can use pipes with the mutate() function in tidyverse

  • We are also going to use ifelse() to make this work
Code
data <- data |>
          mutate(educyears = ifelse(educyears == "Not Mentioned", NA, educyears), # if educyears=="Not Mentioned", replace
                educyears = as.numeric(educyears))    # replace educyears as numeric (instead of character)
summary(data$educyears)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   7.000   7.000   6.735   7.000  20.000    3113 

Note that we could wrap as.numeric() around the ifelse() command to do it on one line!

Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears))) # wrapped into one line
summary(data$educyears)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   7.000   7.000   6.735   7.000  20.000    3113 

Missings and functions in R

In Stata, by default, functions ignore missing values

  • R does not do this by default. Look at this:
Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears))) # wrapped into one line
mean(data$educyears)
[1] NA

If there are any missing values, the function will evalute to missing!

  • But we can also do this:
Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears))) # wrapped into one line
mean(data$educyears, na.rm = TRUE) # BE CAREFUL WITH THIS! Make sure it is indeed what you want to do.
[1] 6.735107

Functions and storing values

The mean() function in the previous slide outputs a single value - That means we could store that value as an object:

Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears))) # wrapped into one line
meaneduc <- mean(data$educyears, na.rm = TRUE)
sdeduc <- sd(data$educyears, na.rm = TRUE)
meaneduc
[1] 6.735107
Code
sdeduc
[1] 2.404086

How is this helpful? We can use these values later in our script!

Functions and mutate()

We can combine the mean() and sd() functions within mutate to create a new, standardized variable:

Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears)), # wrapped into one line
                 educyears_std = (educyears - mean(educyears))/sd(educyears))
summary(data$educyears_std)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
     NA      NA      NA     NaN      NA      NA    7209 

Oh no! what happened?

Functions and mutate()

We can combine the mean() and sd() functions within mutate to create a new, standardized variable:

Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears)), # wrapped into one line
                 educyears_std = (educyears - mean(educyears, na.rm = T))/sd(educyears, na.rm = T))
summary(data$educyears_std)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-2.3856  0.1102  0.1102  0.0000  0.1102  5.5176    3113 

Note that we can shorten TRUE to T (or FALSE to F).

Visualizations with ggplot2

  • ggplot2 is a flexible way to create visualizations in R
  • The basic idea is that you create a plot object and then add layers to it
  • Let’s create a histogram of educyears

Visualizations with ggplot2

Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears)))
# we call ggplot() and NOT ggplot2()
ggplot() +   # note how we use + here, NOT the pipe operator
  geom_histogram(data = data, aes(x = educyears)) # the histogram with geom_histogram
Code
# data = data tells R to use the data frame "data", and the aes() is the aesthetic
# only an x value here since a histogram uses just a SINGLE value

Visualizations with ggplot2

Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears)))
# we can save the plot as an object
g1 <- ggplot() +
        geom_histogram(data = data, aes(x = educyears))
g1

Visualizations with ggplot2

Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears)))
# lots of ways to change the plot
g1 <- ggplot() +
        geom_histogram(data = data, aes(x = educyears)) +
        labs(
          title = "Histogram of educyears",
          x = "Years of education",
          y = "Count"
        )
g1

One more example

Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears)))
g1 <- ggplot() +
        geom_histogram(data = data, aes(x = educyears)) +
        labs(
          title = "Histogram of educyears",
          x = "Years of education",
          y = "Count") +
        theme_bw()
g1

Let’s try this with a NEW dataset

First install a new package that has a dataset we will use (you can do this in the console):

Code
install.packages("nycflights13")

Now let’s see:

Code
library(nycflights13)
glimpse(flights)
Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

Dates with lubridate

There’s a nice package called lubridate that makes working with dates much easier.

Code
library(lubridate)
# create a date variable
flights$date <- as_date(paste0(flights$year, "-", flights$month, "-", flights$day))
head(flights$date)
[1] "2013-01-01" "2013-01-01" "2013-01-01" "2013-01-01" "2013-01-01"
[6] "2013-01-01"

Dates with lubridate

Departure time/arrival time is in the format HHMM (e.g., 1530 is 3:30pm). We can add this to the date

Code
flights$dep_time_new <- hm(paste0(flights$dep_time %/% 100, ":", flights$dep_time %% 100))
head(flights$dep_time_new, n = 20)
 [1] "5H 17M 0S" "5H 33M 0S" "5H 42M 0S" "5H 44M 0S" "5H 54M 0S" "5H 54M 0S"
 [7] "5H 55M 0S" "5H 57M 0S" "5H 57M 0S" "5H 58M 0S" "5H 58M 0S" "5H 58M 0S"
[13] "5H 58M 0S" "5H 58M 0S" "5H 59M 0S" "5H 59M 0S" "5H 59M 0S" "6H 0M 0S" 
[19] "6H 0M 0S"  "6H 1M 0S" 

One more example

Lubridate also lets us work with “periods”

Code
flights$dep_delay_new <- as.period(flights$dep_delay, unit = "minute")
# NOTE: You have to be very careful with taking means/medians, etc.
head(flights$dep_delay_new)
[1] "2M 0S"  "4M 0S"  "2M 0S"  "-1M 0S" "-6M 0S" "-4M 0S"

Let’s look at some new tidyverse functions

Let’s get the average departure delay by NYC airport:

Code
# Remember I said be careful with means of periods/durations! Using the original value here.
flights |> 
    group_by(origin) |> # this groups ROWS based on their origin value
    summarize(avg_dep_delay = mean(dep_delay, na.rm = T)) # this summarizes the data, creating means absed on the grouping!
# A tibble: 3 × 2
  origin avg_dep_delay
  <chr>          <dbl>
1 EWR             15.1
2 JFK             12.1
3 LGA             10.3

Note that this does not create a single value. Instead it creates a tibble (a data frame) summarizing the data by our grouping variable.

Let’s look at some new tidyverse functions

What if we want to save that tibble instead?

Code
summat <- flights |> 
            group_by(origin) |> # this groups ROWS based on their origin value
            summarize(avg_dep_delay = mean(dep_delay, na.rm = T)) # this summarizes the data, creating means based on groups!
summat # print the 3x2 matrix in the console
# A tibble: 3 × 2
  origin avg_dep_delay
  <chr>          <dbl>
1 EWR             15.1
2 JFK             12.1
3 LGA             10.3

I could then output this to a table if I wanted to (using Markdown, more on this later):

origin avg_dep_delay
EWR 15.10795
JFK 12.11216
LGA 10.34688

Let’s look at a new plot

How does departure delay vary by time of day?

Code
ggplot() + 
  geom_smooth(data = flights, aes(x = sched_dep_time, y = dep_delay))

Let’s look at a new plot

We can color code by origin, too!

Code
ggplot() + 
  geom_smooth(data = flights, aes(x = sched_dep_time, y = dep_delay, color = origin))

Make it prettier

Code
ggplot() + 
  geom_smooth(data = flights, aes(x = sched_dep_time, y = dep_delay, color = origin), se = FALSE) +
  labs(
    x = "Scheduled departure time",
    y = "Departure delay (minutes)") +
  theme_bw() + guides(color = guide_legend(title = "Departure airport"))